-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for dataset factories feature #2670
Conversation
Signed-off-by: Ankita Katiyar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start for the docs!
Some comments: I think we should refer to the feature as "dataset factories" instead of "dataset factory patterns". IMO the complete dataset block, e.g.:
"{namespace}.{dataset_name}#csv":
type: pandas.CSVDataSet
filepath: data/{namespace}/{dataset_name}.csv
is the dataset factory and the name: "{namespace}.{dataset_name}#csv"
is the pattern.
Several users mentioned in the interview that they would like to see more examples, specifically how the patterns are referenced in the pipeline. The data catalog page is super long and I know @stichbury is keen to restructure it, so I can imagine dataset factories will perhaps get a separate page, but for now I think it's important to include lots of examples even if that increases this page more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, with the suggestions. I made a few edits to @merelcht's to add a few more minor fixes but they're all good and this is such a great feature. I think we could do a mini blog post about this with some examples, maybe? WDYT?
Co-authored-by: Merel Theisen <[email protected]> Co-authored-by: Jo Stichbury <[email protected]>
A blog post would be great! I think we should also do a show and tell session for our users with this feature to drive adoption. There was so much interest for the interviews so I can imagine lots of people would like to know about this. |
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@merelcht & @stichbury Thanks for the feedback! I've incorporated all the edits. I've also added more examples and structured them under separate subheadings. Would you mind taking another look and letting me know if I've missed anything? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great now! 👍 Left some small edits and suggestions.
Co-authored-by: Merel Theisen <[email protected]>
Actually one more addition after having gone through all the interviews again: an example of how to use namespaces. E.g. a pattern that's "{namespace}.{dataset_name}#csv" and then in the pipeline they wouldn't have to explicitly add the namespace because Kedro adds that. |
Signed-off-by: Ankita Katiyar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @ankatiyar! Just added one nitpick, but approved anyway
``` | ||
The datasets in this catalog can be generalised to the following dataset factory: | ||
```yaml | ||
"{name}_data": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the first example, I'm wondering if it would be good to add an "info" admonition saying
"If you don't use a suffix the default dataset will be overridden, see [this section below]"
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the first example basic enough to get started but the second example actually shows a situation where you might need to add a suffix/prefix. I added a small explanation there as to why.
Signed-off-by: Ankita Katiyar <[email protected]>
``` | ||
These datasets can be combined into the following dataset factory: | ||
```yaml | ||
"{dataset_name}#csv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is #
a convention we recommend? I know it doesn't matter what character it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe something else like @
, $
, %
? (Sorry if this has been treated already elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not something we recommend afaik, I've tried to use different delimiters for each example, underscore, hash symbol, period etc
docs/source/data/data_catalog.md
Outdated
shuttles: | ||
type: pandas.CSVDataSet | ||
filepath: data/01_raw/shuttles.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is just an example, but it reuses a lot of spaceflights
tutorial, if I create a spaceflights project and do this I get error because I only have shuttles.xlsx
but not shuttles.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I changed it to be csv just for the example but you are right, it might cause confusion. Changed it!
Signed-off-by: Ankita Katiyar <[email protected]>
Description
Resolve #2666
To be merged with/after #2635
Development notes
Documentation for the dataset factories feature to the "The Data Catalog" page.
Checklist
RELEASE.md
file